Reconciling Multiple Connectivity Scores for Drug Repurposing

Kewalin Samart https://github.com/KewalinSamart (Mathematics, Michigan State University)https://github.com/orgs/JRaviLab , Phoebe Tuyishime https://github.com/phoebetuyishime (Food Science and Nutrition, Michigan State University)https://github.com/orgs/JRaviLab , Arjun Krishnan https://www.thekrishnanlab.org/ (Computational Mathematics, Science, and Engineering; Biochemistry and Molecular Biology, Michigan State University)https://cmse.msu.edu/directory/faculty/ , Janani Ravi https://jravilab.github.io (Pathobiology and Diagnostic Investigations, Michigan State University)https://cvm.msu.edu/directory/ravi
2020-10-03

Abstract

The basis of several recent methods for drug repurposing is the key principle that an efficacious drug will reverse the disease molecular ‘signature’ with minimal side-effects. This principle was defined and popularized by the influential ‘connectivity map’ study in 2006 regarding reversal relationships between disease- and drug-induced gene expression profiles, quantified by a disease-drug ‘connectivity score.’ Over the past 14 years, several studies have proposed variations in calculating connectivity scores towards improving accuracy and robustness in light of massive growth in reference drug profiles. However, these variations have been formulated inconsistently using varied notations and terminologies even though they are based on a common set of conceptual and statistical ideas. Therefore, we present a systematic reconciliation of multiple disease-drug connectivity scores by defining them using consistent notation and terminology. In addition to providing clarity and deeper insights, this coherent definition of connectivity scores and their relationships provides a unified scheme that newer methods can adopt, enabling the computational drug-development community to compare and benchmark different approaches easily. To facilitate the continuous and transparent integration of newer methods, this review will be available as a live document at https://jravilab.github.io/connectivity_score_review coupled with a GitHub repository https://github.com/jravilab/connectivity_score_review that any researcher can build on and push changes to.

Introduction

The manifestation of a disease or perturbation by a small molecule in a tissue leaves a characteristic imprint (a “signature”) in its gene expression profile (Huang 2019). These signatures, recorded for thousands of diseases and drugs, form the basis of a powerful and widely-adopted method for drug repurposing called “drug-disease connectivity analysis” (Alexandra B. Keenan and Ma’ayan 2019). In this analysis, novel drug indications for a specific disease of interest are identified based on the extent to which the ranked drug-gene signature is a “reversal” of the disease gene signature (Dudley (2011) Sirota (2011); Fig. 1). Connectivity-based drug repurposing has been used to discover drugs in various cancers and non-cancer diseases (Vineela Parvathaneni 2019).

Figure 1. Drug-disease connectivity. A. Gene expression signatures. \(L\) is a rank-ordered drug gene expression signature going from the most significantly up-regulated genes to the most significantly down-regulated genes. \(S\) is the gene set for the disease of interest with \(S_{up}\) containing the set of up-regulated genes and \(S_{down}\) containing the set of down-regulated genes. B. Connectivity. Positions of \(S_{up}\) and \(S_{down}\) disease genes in the ranked drug list, \(L\), determine the signs of enrichment scores (\(ES\); \(ES_{up}\), \(ES_{down}\)). Positive connectivity is defined as the case when the disease signature and drug profile show similar perturbations, i.e. when \(ES_{up}\) is positive and/or when \(ES_{down}\) is negative. This happens when \(S_{up}\) predominantly appears towards the top of the drug profile or when \(S_{down}\) appears predominantly towards the bottom of the drug profile (scenarios 1 and 4). Conversely, negative connectivity is defined as the case when the disease signature and drug profile show dissimilar perturbations, i.e. when \(ES_{up}\) is negative and/or when \(ES_{down}\) is positive. This happens when \(S_{up}\) predominantly appears towards the bottom of the drug profile or when \(S_{down}\) appears towards the top of the drug profile (scenarios 2 and 3). Negative connectivity indicates drug reversal of disease signature.

From its inception in 2006, the exact method for connectivity analysis has evolved, with a series of proposed modifications over the past decade and a half (Fig. 2A). The first method for connectivity analysis (Justin Lamb 2006) builds on the seminal paper by Subramanian et al., 2005 (Aravind Subramanian and Mesirov 2005) that proposed the Gene Set Analysis (GSEA) method. GSEA uses a modified Kolmogorov-Smirnov statistic (Myles Hollander 1999) – referred to as “enrichment statistic” (\(ES\)) – to evaluate if genes in a certain pathway appear towards the top or bottom of a gene (differential) expression profile. Lamb et al., 2006 (Justin Lamb 2006) built a reference database (CMap, which we refer to as CMap 1.0 in this review) with gene expression profiles for 1000s of small molecules and proposed the first method for connectivity analysis based on GSEA. This method compares a query signature (disease) to each of the ranked drug-gene expression profiles in their reference database and ranks all the drugs based on their connectivity scores. A connectivity score ranges between -1 (indicating a complete ‘drug-disease’ reversal) and +1 (indicating perfect ‘drug-disease’ similarity). Another study adapted this connectivity score calculation and used it to find compound in the L1000 LINCS collection (Qiaonan Duan 2014) that could be repurposed for three cancer types (Bin Chen 2017). This study quantified the reversal relationship between the drug and disease by computing the Reverse Gene Expression Signature (\(RGES\)). Finally, CMap 1.0 itself was further updated by expanding the LINCS L1000 to more than 1.3 million profiles (Aravind Subramanian and Golub 2017) (referred to as CMap 2.0 in this review). Along with expansion of data, the CMap 2.0 study also proposed another variation of the connectivity score called the weighted connectivity score that uses GSEA’s weighted Kolmogorov-Smirnov enrichment statistic along with ways to normalize the resulting score and correcting them further to account for background associations.

A taxonomy of connectivity scores

Figure 2. A taxonomy of connectivity scores. A. Relationship between connectivity scores. The main formulations discussed here are GSEA enrichment score (\(ES\)) (Aravind Subramanian and Mesirov 2005), CMap 1.0 connectivity score (\(CS\)) (Justin Lamb 2006), \(RGES\) and \(sRGES\) (Bin Chen 2017), CMap 2.0 weighted connectivity score (\(WCS\)), normalized connectivity score (\(NCS\)), and Tau score (\(\tau\)) (Aravind Subramanian and Golub 2017). B. Detailed definitions of connectivity scores in A.

Connectivity scores and methodologies have been evaluated in the past to assess their performance in predicting drug-drug relationships or drug-disease relationships. The performance of CMap 1.0 was evaluated in predicting drug-drug relationships using the Anatomical Therapeutic Chemical classification Iskar (2010) Cheng J. (2013), and in predicting drug-disease relationships (Cheng 2014). Furthermore, a recent review (Musa 2018) assessed advances that have been made in CMap 1.0 and computational tools that have been applied in the drug repurposing and discovery fields. Lin et al., 2019 (Lin 2019) further evaluated connectivity approaches that use L1000 data (Qiaonan Duan 2014), including six different scores that are used to predict drug-drug relationships.

All these proposed variations of the connectivity score share a common set of conceptual and statistical ideas. Yet, they have been formulated inconsistently using varied notations and terminologies in the original papers and in the aforementioned evaluation studies. This lack of consistency in the precise formulaic notation makes it difficult to seamlessly understand the subtle differences and the intuition underlying each score. For example, the connectivity score referred to as “\(RGES\)(Bin Chen 2017) directly builds on “\(CS\)(Justin Lamb 2006). Another example is the “\(WCS\)” in (Aravind Subramanian and Golub 2017) that is a bi-directional weighted version of “\(KS(ES)\)” used in GSEA (Aravind Subramanian and Mesirov 2005); in this case, they are named and notated quite differently though they are essentially direct, simple variants of each other. In this review, we develop a systematic scheme that defines in the aforementioned methodologies using consistent notations and terms. Additionally, we provide summary tables throughout the article to relate our consistent scheme with the previously published ones.

We begin creating a standardized set of notations and terms to denote the various concepts and quantities required to define the different connectivity scores. A connectivity score between a disease and a drug is computed by comparing the genes up- (\(S_{up}\)) and down-regulated (\(S_{down}\)) by the disease (compared to a healthy control) to a ranked list of genes (\(L\)) ordered based on their differential expression in response to a drug. A good connectivity score is usually a lower negative value since it is designed to indicate a reversal relationship between the disease and the drug. A good connectivity score is usually achieved when genes in \(S_{up}\) appear at the bottom of \(L\) and/or when genes in \(S_{down}\) appear at the top of \(L\). When there is no relationship or when \(S_{up}\) appears at the top and/or when \(S_{down}\) appears at the bottom of \(L\) (i.e., similarity between the disease and drug signatures), the drug is unlikely to be efficacious in treating that disease. These scenarios are depicted in Figure 1, and the general notations, which we use throughout this work, are presented in Table 1, Figure 1.

Building on these general notations and terms, in the rest of this review, we develop and present a systematic scheme that defines four formulations of the drug-disease connectivity scores using consistent notations and terms, detailed formulation, and a summary table that will enable researchers to relate our consistent scheme back to the notations and terminology used in the original publications.

Table 1. General Notations

Notation Description
\(S\) disease gene set (i.e., query) (Fig. 1) Without any loss in generality, only the subset of disease genes that are also part of \(L\) are considered throughout (i.e., \(S \subseteq L\) ).
\(S_{up}\) disease up-regulated gene set; \(S_{up} \subseteq S\)
\(S_{down}\) disease down-regulated gene set; \(S_{down} \subseteq S\); \(S = S_{up} \cup S_{down}\)
\(L\) rank-ordered list (drug) (Fig. 1)
\(N_{S}\) number of genes in \(S\)
\(N_{L}\) number of genes in \(L\)
\(gl_{i}\), \(gs_{i}\) \(i^{th}\) gene in list \(L\) or set \(S\)
\(idx(L,gs_{i})\) index of gene \(gs_{i}\) in list \(L\)
\(t\) each treatment instance (i.e., a treated-and-vehicle-control pair) that results in a single drug profile \(L\).
\(N_{D}\) total number of drug profiles (\(L\)) in the reference database
\(N_{d}\) number of drug profiles (\(L\)) in the reference database that corresponds to a specific drug \(d\)
KS Kolmogorov-Smirnov
\(ES\) enrichment score
\(ES_{up}\) \(ES\) for up-regulated gene set (\(S_{up}\))
\(ES_{down}\) \(ES\) for down-regulated gene set (\(S_{down}\))

Gene Set Enrichment Analysis (GSEA)

All connectivity scores described here begin with the calculation of some form of an Enrichment Score (\(ES\)) that captures the relationship between a drug and a disease. The basis of all these \(ES\) formulations is the Gene Set Enrichment Analysis (GSEA) (Aravind Subramanian and Mesirov 2005); that was originally developed to assess the enrichment (over-representation) of predefined biological gene sets (e.g., pathways, targets of a regulator, etc.) at the top or bottom of a list of genes ranked by their extent of differential expression in response to an experimental factor of interest. Enriched gene sets are then hypothesized to be biologically relevant to that experimental factor. When adapted to the question of drug repurposing, a method like GSEA can be used to assess the enrichment of sets of genes associated with a disease at the top or bottom of a list of genes ranked by their extent of differential expression in response to a drug (Fig. 1).

Enrichment Score (ES)

GSEA is a weighted signed version of the classical Kolmogorov-Smirnov test. It takes two inputs: i) a disease gene set composed of a set of genes significantly perturbed in response to a disease (denoted \(S\)), and ii) a rank-ordered list (\(L\)) of drug genes (in decreasing order of a drug-response score \(d(gl_{j})\) for each gene \(gl_{j}\)). Using these two inputs, GSEA quantifies the level of association between the disease and the drug by calculating an enrichment score (\(ES\)) based on the following steps:

  1. For each position \(i\) in the rank-ordered list (\(L\)) from top to bottom,

        1.1. if the gene is in \(S\), calculate: \[ P_{hit}(S,i) = \displaystyle\sum_{\substack{gl_{j} \in S \\ j \leq i}}\frac{|d(gl_{j})|^{w_{ES}}}{N_{|(d)|}},\qquad where \qquad N_{|(d)|} = \displaystyle\sum_{gl_{j} \in S}|d(gl_{j})|^{w_{ES}} \]         1.2. if the gene is not in \(S\), calculate: \[ P_{miss}(S,i) = \displaystyle\sum_{\substack{gl_{j} \notin S \\ j \leq i}}\frac{1}{N_{L}-N_{S}} \]         1.3. calculate the positional enrichment score (\(es_{i}\)) \[ es_{i} = P_{hit}(S,i) - P_{miss}(S,i) \] 2. Finally, calculate the final enrichment score (\(ES\)): \[ES = max_{i}(es),\] the maximum positional enrichment score.

When \(w_{ES}=0\), \(N_{|(d)|} = \displaystyle\sum_{\substack{gl_{j} \in S}}|d(gl_{j})|^{0} = N_{S},\) which result in \[P_{hit}(S,i) = \displaystyle\sum_{\substack{gl_{j} \in S \\ j \leq i}}\frac{|d(gl_{j})|^{0}}{N_{|(d)|}} = \displaystyle\sum_{\substack{gl_{j} \in S \\ j \leq i}}\frac{1}{N_{S}}.\] Thus, \(P_{hit}(S,i)\) and \(P_{miss}(S,i)\) are both empirical distribution functions of the positions of the disease genes (i.e., \(S\)) and the positions of the non-disease genes (i.e., \(L-S\)), respectively, in the drug gene list \(L\). Therefore, when \(w_{ES}=0\), \(ES\) (the signed maximum distance between the two functions) reduces to a signed two-sample Kolmogorov-Smirnov (KS) statistic: \[ ES = max(P_{hit}(S,i)-P_{miss}(S,i)) = sign(P_{hit}(S,i)-P_{miss}(S,i)) \times KS \] where \[ KS = max|F_{S}(i)-F_{L-S}(i)| \] is the classical two-sample KS statistic, with \(F_{S}\) and \(F_{L-S}\) being the empirical distribution function of \(S\) and \(L-S\), respectively, defined as follows:

\[ F_{S}(i) = \frac{1}{N_{S}}\sum^{N_{S}}_{\substack{j=1 \\ gl_{j} \in S}} 1_{j \leq i} ,\qquad F_{L-S}(i) = \frac{1}{N_{L}-N_{S}}\sum_{\substack{j=1 \\ gl_{j} \notin S}} 1_{j \leq i} \]

When \(w_{ES}=1\), \(ES\) becomes a weighted signed two-sample KS statistic with each position \(j\) in the drug gene list \(L\) weighted by the drug-response score \(d(gl_{j})\). Setting \(w_{ES}\) to one is recommended for GSEA. We point the reader to the original GSEA publication for a discussion of statistics when \(w_{ES}\) is set to lesser or greater than one.

Summary

Figure 3. Connectivity scores vs drug reversal phenotype. The figure shows estimated signs of the different connectivity scores for all eight scenarios corresponding to combinations of up- and down-regulated disease genes (\(S\)) and their relative position on the drug list (\(L\)). The top three scenarios (coded in blue) correspond to favorable outcomes of the drug fully or partially reversing the disease gene signature. The bottom three scenarios (coded in red) correspond to unfavorable outcomes of the drug not reversing the disease gene signature. The middle two scenarios (coded in grey) indicates neutral outcomes.

Figure 4. ES distribution for up- and down-regulated genes. Shown are plots of the running sum of up- and down-regulated gene sets (green curves) of an example liver cancer dataset (GSE84073 (Broutier L 2017)), including the location of the maximum and minimum enrichment scores (top and bottom dashed red lines) and the leading-edge subset (vertical black lines (\(S\) genes in \(L\)) that show up at, or before, the running sum reaches the final enrichment score (the maximum deviation from zero)).

Table 2. GSEA Notations

Current Notation Previous Notation Description
\(w_{ES}\) \(p\) the weight of the step in enrichment score calculation
\(gl_{j}\) \(g_{j}\) a \(L\) gene at index \(j\); \(i,j\) are indices of genes
\(d(gl_{j})\) \(r_{j}\) the drug-response score of gene \(gl_{j}\) in drug gene list \(L\); this score is used to rank the genes in \(L\)
\(N_{|(d)|}\) \(N_{R}\) the sum of absolute drug gene score (\(d(gl_{j})\)) of every \(L\) gene in \(S\) weighted by \(w_{ES}\)
\(P_{hit}(S,i)\) \(-\) the fraction of genes in \(S\) (“hits”) weighted by their drug gene score (\(d(gl_{j})\))
\(P_{miss}(S,i)\) \(-\) the fraction of genes not in \(S\) (“misses”)
\(N_{L}\) \(N\) number of genes in \(L\)
\(N_{S}\) \(N_{H}\) number of genes in \(S\)

Connectivity Map 1.0: Disease-Drug Connectivity Scores (CMap 1.0)

The connectivity map 1.0 (CMap 1.0) project pioneered the identification of drug candidates based on their ability to reverse disease gene expression profiles (Justin Lamb 2006). Key to this project was the creation of a large collection of reference gene expression profiles of multiple human cell lines that are treated with 164 small molecules, including approved drugs. The expression profiles were generated using Affymetrix microarrays. The original CMap 1.0 study and several others focused on cancer (Singh AR 2016), inflammatory bowel disease (Dudley 2011) and spinal muscular atrophy (Dyle MC 2014) have used this reference library of drug profiles for drug repurposing. In all these cases, the starting point is a disease “signature” defined by the sets of genes up- and down-regulated in the disease. This signature is compared to each drug profile in the reference library using a GSEA-like analysis that results in an enrichment score (\(ES\)) for each of the up- and down-regulated disease gene sets separately. The \(ES\) captures the level and direction of association of the disease gene set with that drug. Then, the ‘up’ and ‘down’ \(ES\) are combined into a single connectivity score (\(CS\)) for the disease with respect to that drug. Finally, for the given disease, drug candidates are identified as those that have low negative \(CS\).

ES Calculation

The drug-disease enrichment score (\(ES\)) in CMap 1.0 is adapted from GSEA. Instead of using GSEA’s signed two-sample KS test formulation that compares the positions of \(S\) genes to those of \(L-S\) genes, CMap 1.0 uses a signed one-sample KS test to compare the empirical distribution of the positions of \(S\) genes in \(L\) compared to a reference uniform distribution (of disease genes in the drug gene list):

\[ ES = \left\{ \begin{array}{ll} a & \quad ,if \qquad a > b \\ -b & \quad ,if \qquad b > a \end{array} \right. \] \(\qquad\) where

\[ \displaystyle a = \max_{i=1}^{N_{S}}[\frac{i}{N_{S}}-\frac{idx(L,gs_{i})}{N_{L}}] \] \[ \displaystyle b = \max_{i=1}^{N_{S}}[\frac{idx(L,gs_{i})}{N_{L}}-\frac{(i-1)}{N_{S}}] \]

This formulation is used to calculate an \(ES_{up}\) and an \(ES_{down}\) value for the genes up- (\(S_{up}\)) and down-regulated (\(S_{down}\)) by the disease, respectively.

Connectivity Score (CS) Calculation - Normalization across treatment instances

These two scores are then used to calculate a raw connectivity score \(cs\):

\[ cs = \left\{ \begin{array}{ll} ES_{up}-ES_{down} & \quad ,if \qquad sign(ES_{up}) \neq sign(ES_{down}) \\ 0 & \quad,otherwise \end{array} \right. \]

The final connectivity score is calculated by normalizing the raw score by dividing by the maximum or minimum of raw scores across treatment instances, depending on the sign of \(cs\), bringing it back to range between –1 and +1:

\[ CS = \left\{ \begin{array}{ll} \frac{cs}{max_{t}(cs)} & \quad ,if \qquad cs > 0 \\ \frac{-cs}{min_{t}(cs)} & \quad ,if \qquad cs < 0 \end{array} \right. \]

Summary

Table 3. CMap 1.0 Notations

Current Notation Previous Notation Description
\(CS\) \(S^{i}\) connectivity score; normalized connectivity score across all treatment instances
\(t\) \(i\) treatment instances
\(cs\) \(s^{i}\) connectivity score for each treatment instance
\(ES\) \(ks\) enrichment score
\(idx(L,gs_{i})\) \(V(j)\) position of \(gs_{i}\) in \(L\)
\(N_{S}\) \(t\) number of genes in \(S\)
\(N_{L}\) \(n\) number of genes in \(L\)

Reverse Gene Expression Scores (RGES)

The Connectivity Map project was subsequently expanded into the NIH library of integrated network-based cellular signatures (LINCS) program by using a cost-effective gene-expression assay called L1000 (Aravind Subramanian and Golub 2017). The L1000 platform measures only about 1000 carefully-chosen genes with the rest of the transcriptome estimated by an imputation model trained using publicly available genome-scale expression data (Barrett 2007). The pilot phase of the LINCS program included data for about 20,000 compounds assayed on about 50 human cell lines across a range of doses to result in over one million L1000 profiles.

The focus of the study by Chen et al.,2017 (Bin Chen 2017) was to use this LINCS data to not only capture expression-based drug-disease reversal relationships but also evaluate if these reversals correlate with independently-measured drug efficacies. Towards this goal, the authors selected compounds with both efficacy data in ChEMBL (David Mendez 2019) and gene expression LINCS data. Using these two datasets, this study showed that the distribution of connectivity scores (\(CS\)) from CMap 1.0 (Justin Lamb 2006) are enriched at 0 and that these scores do not correlate well with \(IC_{50}\) values. To address this issue, the authors proposed a new connectivity score called the Reverse Gene Expression Score (\(RGES\)). In CMap 1.0, the connectivity score for a drug is set to zero if \(ES_{up}\) and \(ES_{down}\), the enrichment scores for the up- and down-regulated disease gene sets have the same signs. \(RGES\), on the other hand, is computed as the difference between absolute values of the two \(ES\) values:

\[ RGES = |ES_{up}|-|ES_{down}| \]

Summary

Summarization of Reverse Gene Expression Score

Since the LINCS dataset contains multiple profiles corresponding to the same drug assayed on multiple cell lines, concentrations, and time points, the study also proposed summarizing a drug’s \(RGES\) values across these various conditions into a single score called the Summarization of Reverse Gene Expression Score (\(sRGES\)). \(sRGES\) is estimated by first setting the condition that corresponds to 10 \(\mu M\) and 24 hours (the most common in the LINCS database) as the ‘reference’ condition and setting all other conditions as ‘target’ conditions. Then, for a specific cell line, a drug’s \(RGES\) in a target condition is assumed to be dependent on the target condition’s dose and time relative to the reference condition, quantified using a heuristic “awarding function” (\(f\)):

\[ f(dose(t),time(t)) = \left\{ \begin{array}{ll} \alpha, \quad dose(t) < 10 \mu M \quad and \quad time(t) < 24 \quad hours \\ \beta, \quad dose(t) < 10 \mu M \quad and \quad time(t) \geq 24 \quad hours \\ \gamma, \quad dose(t) \geq 10 \mu M \quad and \quad time(t) < 24 \quad hours \\ 0, \quad dose(t) \geq 10 \mu M \quad and \quad time(t) \geq 24 \quad hours \end{array} \right. \]

Target conditions are first divided into four groups (as in the equation above), and the value of the function for each target group (e.g. \(dose(t)<10 \mu M\) and \(time(t)<24\) hours) is estimated by averaging the difference in \(RGES\) between the target group and reference group across all the drugs in the reference database that were profiled in the same cell line in that target condition and the reference condition.

Then, to combine \(RGES\) values across cell lines, a weight \(w(t)\) is calculated for each treatment that reflects how much that treatment’s corresponding cell line, \(cell(t)\) is similar to the disease under study:

\[ w(t) = \frac{cor(cell(t), disease))}{max_{k}(cor(cell(k),disease))} \]

Here, the correlation between cell line \(cell(t)\) and the disease, \(cor(cell(t), disease))\), is the average of the Spearman correlations between the expression profiles of the cell line and disease of interest, normalized by the maximum correlation between all cell lines and the disease. Finally, \(sRGES\) is defined as the following:

\[ sRGES = \displaystyle\sum_{t}^{N_{d}}(RGES(t)+f(dose(t), time(t))) \times \frac {w(t)}{N_{d}} \]

This study shows that these new formulations of the connectivity scores, \(RGES\) and \(sRGES\), show a correlation with drug \(IC_{50}\) values, with drugs with low negative \(RGES\) or \(sRGES\) tending to have low \(IC_{50}\) values.

Summary

Table 4. \(RGES\) and \(sRGES\) Notations

Current Notation Previous Notation Description
\(RGES\) \(-\) reverse gene expression score
\(sRGES\) \(-\) summarized reverse gene expression score
\(f(dose(t),time(t))\) \(f(dose(i),time(i))\) the difference in \(RGES\) between a target condition and reference condition, modeled as a function of dose and time
\(cor(cell(t),disease)\) \(cor(cell(i),disease)\) the average Spearman correlation between the expression profiles of a cell line \(cell(t)\) and the disease of interest
\(ES\) \(KS\) enrichment score
\(N_{d}\) \(N\) number of treatments for a given drug (\(d\))
\(t\) \(i\) treatment instances

CMap 2.0 Connectivity Scores

CMap 2.0 is a massive expansion of the L1000 dataset to ~1.4 million profiles, which represent 42K genetic and small molecules perturbed across multiple cell lines (Aravind Subramanian and Golub 2017). As part of the release of this data, the study also proposed new connectivity score calculations (Weighted Connectivity Score, Normalized Connectivity Score, and Tau Score). Similar to other scenarios outlined above, the CMap 2.0 methodology works by comparing the disease gene set (\(S\)) (containing the up- (\(S_{up}\)) and down-regulated (\(S_{down}\)) genes) to reference drug profiles in the L1000 database to get a rank-ordered list of all drugs based on a slightly new formulation of the connectivity score, along with new proposals for normalizing the scores across cell lines and drug types and for correcting the resulting normalized score against the background of the entire reference library.

Weighted Connectivity Score (\(WCS\))

The disease-drug enrichment score (\(ES\)) in CMap 2.0 is based directly on GSEA’s weighted signed two-sample KS statistic that compares the positions of \(S\) genes to those of \(L-S\) genes with the weight \(w_{ES}\) set to 1. \(ES\) is then used to calculate a Weighted Connectivity Score (\(WCS\)) that represents a non-parametric disease-drug similarity measure. \(WCS\) is defined as follow:

\[ WCS = \left\{ \begin{array}{ll} (ES_{up}-ES_{down})/2 & \quad,if \quad sign(ES_{up}) \ne sign(ES_{down})\\ 0 & \quad, otherwise \end{array} \right. \]

Summary

Normalized Connectivity Score (\(NCS\))

The Normalized Connectivity Score (\(NCS\)) was developed to enable the comparison of \(WCS\) across cell lines and drug type. Given the \(WCS\) for a disease in relation to a specific drug of a type \(dt\), tested in cell line \(c\), the corresponding \(NCS\) is computed by rescaling the \(WCS\) by dividing by the mean \(WCS\) value across all the drugs of the same type \(dt\) tested in the same cell line \(c\):

\[ NCS = \left\{ \begin{array}{ll} WCS / \mu_{c,dt}^{+} & \quad,if \quad sign(WCS) > 0 \\ WCS / \mu_{c,dt}^{-} & \quad,otherwise \end{array} \right. \]

Here, \(\mu_{c,dt}^{+}\) and \(\mu_{c,dt}^{-}\) are absolute values of the means of the positive and negative \(WCS\) values, respectively. This procedure is identical to that used in the original GSEA for normalizing \(ES\) scores to make them comparable across gene sets of different sizes.

Tau scores

Finally, the Normalized Connectivity Score \(NCS\) for a disease to a specific drug (i.e., the \(NCS\) for a given disease-drug pair) is converted to a tau (\(\tau\)) score by comparing it to \(NCS\) values of that disease to all the drugs in the reference database (referred to as “touchstone” in CMap 2.0) of the same type \(dt\) tested in the same cell line \(c\), expressed as signed percentage value between –100 and +100:

\[ \tau = sign(NCS)\frac{100}{N_{D}}\sum_{k=1}^{N_{D}}[|NCS_{k}|<|NCS|] \] Thus, a \(\tau\) of 95 indicates that only 5% of drugs in the reference database of the same type and tested in the same cell line (containing \(N_{D}\) drugs) showed stronger connectivity to the disease than the drug of interest. Since any disease is queried against the same fixed drug reference database, \(\tau\) values are comparable across diseases.

Another way to calculate a \(\tau\) score corresponding to the \(NCS\) value for a disease-drug pair is to compare to the \(NCS\) values of that specific drug to all the perturbation signatures in a reference database. This comparison will yield a \(\tau\) that represents the signed percentage of reference signatures that are less connected to the drug than the disease of interest. In other words, based on this comparison, a \(\tau\) of 95 indicates that only 5% of signatures in a reference database showed stronger connectivity to the drug than the disease of interest. Similarly, \(\tau\) values in this new setting are comparable across drugs in the reference database.

Summary

Table 5. CMap 2.0 Notations

Current Notation Previous Notation Description
\(WCS\) \(WTCS\); \(w_{c,t}\) weighted connectivity score; also used to refer to a specific instance of the weighted connectivity score of a given cell line \(c\) and perturbagen type \(dt\)
\(c\) \(-\) cell line
\(dt\) \(t\) drug type
\(k\) \(i\) index of each drug in the reference database; \(k\) = 1,2,3,…,\(N_{d}\)
\(\mu^{+}_{c,dt}\), \(\mu^{-}_{c,dt}\) \(\mu^{+}_{c,t}\), \(\mu^{-}_{c,t}\) absolute values of means of positive and negative raw weighted connectivity scores, respectively
\(N_{D}\) \(N\) total number of drug profiles (\(L\)) in the reference database
\(S\) \(q\) disease gene set (i.e query)
\(L\) \(r\) rank-ordered gene list (drug)

Conclusion

In this review, we have reconciled four key formulations of drug-disease connectivity scores by defining them using consistent notation and terminology. This unified scheme will foster long-term adoption and potential collaboration within the growing computational drug-repurposing community. This review provides significant insights on different methods that have been proposed in the drug repurposing field. Our coherent definition of connectivity scores and their relationships will allow researchers to better understand the current state-of-the-art including expressing all other existing methods using the same notation and terminology. The drug-repurposing community can adopt this consolidated framework to develop, compare, and benchmark new computational drug-repurposing quantification metrics in the context of existing methods. To facilitate the continuous and transparent integration of newer methods, this review is hosted in a GitHub repository (https://github.com/jravilab/connectivity_score_review) that can be edited by the research community to include new methods for connectivity score calculation. The review document has been written using RMarkdown Yihui Xie and Grolemund (2018) JJ Allaire and Iannone (2020) and distill (Jones 2018), and rendered as a living document at https://jravilab.github.io/connectivity_score_review.

Alexandra B. Keenan, Zichen Wang, Megan L. Wojciechowicz, and Avi Ma’ayan. 2019. “Connectivity Mapping: Methods and Applications.” https://doi.org/https://doi.org/10.1146/annurev-biodatasci-072018-021211.

Aravind Subramanian, Steven M. Corsello, Rajiv Narayan, and Todd R. Golub. 2017. “A Next Generation Connectivity Map: L1000 Platform and the First 1,000,000 Profiles.” https://doi.org/https://doi.org/10.1016/j.cell.2017.10.049.

Aravind Subramanian, Vamsi K. Mootha, Pablo Tamayo, and Jill P. Mesirov. 2005. “Gene Set Enrichment Analysis: A Knowledge-Based Approach for Interpreting Genome-Wide Expression Profiles.” https://doi.org/https://doi.org/10.1073/pnas.0506580102.

Barrett, Troup, T. 2007. “NCBI Geo: Mining Tens of Millions of Expression Profiles–Database and Tools Update.” https://doi.org/https://doi.org/10.1093/nar/gkl887.

Bin Chen, Hyojung Paik, Li Ma. 2017. “Reversal of Cancer Gene Expression Correlates with Drug Efficacy and Reveals Therapeutic Targets.” https://doi.org/https://doi.org/10.1038/ncomms16022.

Broutier L, Verstegen MM, Mastrogiovanni G. 2017. “Human Primary Liver Cancer-Derived Organoid Cultures for Disease Modeling and Drug Screening.” https://doi.org/https://doi.org/10.1038/nm.4438.

Cheng, Yang, J. 2014. “Systematic Evaluation of Connectivity Map for Disease Indications.” https://doi.org/https://doi.org/10.1186/s13073-014-0095-1.

Cheng J., Kumar V., Xie Q. 2013. “Evaluation of Analytical Methods for Connectivity Map Data.” https://doi.org/https://doi.org/10.1142/9789814447973_0002.

David Mendez, A Patrícia Bento, Anna Gaulton. 2019. “ChEMBL: Towards Direct Deposition of Bioassay Data.” https://doi.org/https://doi.org/10.1093/nar/gky1075.

Dudley, Sirota, J. T. 2011. “Computational Repositioning of the Anticonvulsant Topiramate for Inflammatory Bowel Disease.” https://doi.org/https://doi.org/10.1126/scitranslmed.3002648.

Dyle MC, Cook DP, Ebert SM. 2014. “Systems-Based Discovery of Tomatidine as a Natural Small Molecule Inhibitor of Skeletal Muscle Atrophy.” https://doi.org/https://doi.org/10.1074/jbc.M114.556241.

Huang, Hsieh, C. T. 2019. “Perturbational Gene-Expression Signatures for Combinatorial Drug Discovery.” https://doi.org/https://doi.org/10.1016/j.isci.2019.04.039.

Iskar, Campillos, M. 2010. “Drug-Induced Regulation of Target Expression.” https://doi.org/https://doi.org/10.1371/journal.pcbi.1000925.

JJ Allaire, Jonathan McPherson, Yihui Xie, and Richard Iannone. 2020. Rmarkdown: Dynamic Documents for R. https://github.com/rstudio/rmarkdown.

Jones, Nora. 2018. “Distill for R Markdown.” https://rstudio.github.io/distill.

Justin Lamb, David Peck, Emily D Crawford. 2006. “The Connectivity Map: Using Gene-Expression Signatures to Connect Small Molecules, Genes, and Disease.” https://doi.org/https://doi.org/10.1126/science.1132939.

Lin, Li, K. 2019. “A Comprehensive Evaluation of Connectivity Methods for L1000 Data.” https://doi.org/https://doi.org/10.1093/bib/bbz129.

Musa, Ghoraie, A. 2018. “A Review of Connectivity Map and Computational Approaches in Pharmacogenomics.” https://doi.org/https://doi.org/10.1093/bib/bbw112.

Myles Hollander, Eric Chicken, Douglas A. Wolfe. 1999. Nonparametric Statistical Methods. 3rd ed. Hoboken, New Jersey: John Wiley & Sons, Inc.

Qiaonan Duan, Mario Niepel, Corey Flynn. 2014. “LINCS Canvas Browser: Interactive Web App to Query, Browse and Interrogate Lincs L1000 Gene Expression Signatures.” https://doi.org/https://doi.org/10.1093/nar/gku476.

Singh AR, Zulcic M, Joshi S. 2016. “PI-3K Inhibitors Preferentially Target Cd15+ Cancer Stem Cell Population in Shh Driven Medulloblastoma.” https://doi.org/https://doi.org/10.1371/journal.pone.0150836.

Sirota, Dudley, M. 2011. “Discovery and Preclinical Validation of Drug Indications Using Compendia of Public Gene Expression Data.” https://doi.org/https://doi.org/10.1126/scitranslmed.3001318.

Vineela Parvathaneni, Aaron Muth, Nishant S. Kulkarni. 2019. “Drug Repurposing: A Promising Tool to Accelerate the Drug Discovery Process.” https://doi.org/https://doi.org/10.1016/j.drudis.2019.06.014.

Yihui Xie, J.J. Allaire, and Garrett Grolemund. 2018. R Markdown: The Definitive Guide. Boca Raton, Florida: Chapman; Hall/CRC. https://bookdown.org/yihui/rmarkdown.